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The BagIt File Packaging Format (V1.0) 
Abstract 


This document describes BagIt, a set of hierarchical file layout 
conventions for storage and transfer of arbitrary digital content. A 
"bag" has just enough structure to enclose descriptive metadata 
"tags" and a file "payload" but does not require knowledge of the 
payload’s internal semantics. This BagIt format is suitable for 
reliable storage and transfer. 


Status of This Memo 


This document is not an Internet Standards Track specification; it is 
published for informational purposes. 


This is a contribution to the RFC Series, independently of any other 
RFC stream. The RFC Editor has chosen to publish this document at 
its discretion and makes no statement about its value for 
implementation or deployment. Documents approved for publication by 
the RFC Editor are not candidates for any level of Internet Standard; 
see Section 2 of RFC 7841. 


Information about the current status of this document, any errata, 
and how to provide feedback on it may be obtained at 
https://www.rfc-editor.org/info/rfc8493. 
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Copyright Notice 


Copyright (c) 2018 IETF Trust and the persons identified as the 
document authors. All rights reserved. 


This document is subject to BCP 78 and the IETF Trust’s Legal 
Provisions Relating to IETF Documents 
(https://trustee.ietf.org/license-info) in effect on the date of 
publication of this document. Please review these documents 
carefully, as they describe your rights and restrictions with respect 
to this document. 
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1. Introduction 
1.1. Purpose 


BagIt is a set of hierarchical file layout conventions designed to 
support storage and transfer of arbitrary digital content. A "bag" 
consists of a directory containing the payload files and other 
accompanying metadata files known as "tag" files. The "tags" are 
metadata files intended to facilitate and document the storage and 
transfer of the bag. Processing a bag does not require any 
understanding of the payload file contents, and the payload files can 
be accessed without processing the BagIt metadata. 


The name, BagIt, is inspired by the "enclose and deposit" method 
[ENCDEP], sometimes referred to as "bag it and tag it". BagIt 
differs from serialized archival formats such as MIME, TAR, or ZIP in 
two general areas: 


1. Strong integrity assurances. The format supports cryptographic- 
quality hash algorithms (see Section 2.4) and allows for in-place 
upgrades to add additional manifests using stronger algorithms 
without breaking backwards compatibility. This provides high 
levels of confidence against data corruption, but it is not 
designed to be secure against active attacks. 


2. Direct file access. Because BagIt specifies an actual filesystem 
hierarchy rather than a serialized representation of one, files 
can be accessed using standard operating system utilities, 
implementations do not need to process a potentially large 
archival file to extract a subset of data, and the format imposes 
no size limits for either individual files or a bag. 


BagIt is widely used for preserving digital assets originating from 
different domains. Organizations involved in digital preservation 
with BagIt include the Library of Congress, Dryad Data Repository, 
NSF DataONE, and the Rockefeller Archive Center. Software 


implementations are available for many languages, including Python, 
Ruby, Java, Perl, and PHP. It is also used in the libraries of many 
universities, such as Cornell, Purdue, Stanford, Ghent University, 


New York University, and the University of California. 


1.2. Requirements 
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 


"OPTIONAL" in this document are to be interpreted as described in 
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 
capitals, as shown here. 
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lementers are strongly encouraged to review the interoperability 
siderations described in Section 6.1. 


Terminology 


following terms have precise definitions as used in this 
ument: 


A set of opaque files contained within the structure defined by 
this document. 
declaration: The file required to be in all bags conforming to 
this document. Contains values necessary to process the rest of a 


bag. See Section 2.1.1. 


checksum algorithm: The name of a cryptographic checksum 
algorithm that has been normalized for use in a manifest or tag 
manifest file name (e.g., "sha512") as described in Section 2.4. 


ifest: A tag file that maps filepaths to checksums. A manifest 
can be a payload manifest (see Section 2.1.3) or a tag manifest 
(see Section 2.2.1). 


load: The data encapsulated by the bag as a set of named files, 
which may be organized in subdirectories. The contents of the 
payload files are opaque to this document, and, with respect to 
BagIt processing, are always considered as sequences of 
uninterpreted octets. See Section 2.1.2. 


directory: A directory that contains one or more tag files. 


file: A file that contains metadata about the bag or its 
payload. This document defines the standard BagIt tag files: the 
bag declaration in "bagit.txt" (see Section 2.1.1), payload 


manifests (see Section 2.1.3), tag manifests (see Section 2.2.1), 
bag metadata in "bag-info.txt" (see Section 2.2.2), and remote 
payload in "fetch.txt" (see Section 2.2.3). This document also 


allows other arbitrary tag files as described in Section 2.2.4. 


plete: A bag that contains every element required by this 
document, every payload file listed in a manifest, and any 
optional files that are listed in a tag manifest. See Section 3. 


id: A complete bag where every checksum in every manifest has 
been successfully verified against the corresponding file. 
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2. Structure 
A bag MUST consist of a base directory containing the following: 
1. a set of required and optional tag files (see Section 2.2); 


2. a subdirectory named "data", called the payload directory (see 
Section 2.1.2); and 


3. a set of optional tag directories. 


The tag files in the base directory consist of one or more files 


named "manifest-_algorithm_.txt" (see Sections 2.1.3 and 2.4), a file 
named "bagit.txt" (see Section 2.1.1), and zero or more additional 
tag files (see Section 2.2). The tag files and directories are in 


arbitrary file hierarchies and MAY have any name that is not reserved 
for a file or directory in this document. 


The base directory can have any name, as illustrated by the figure 
below. 


<base directory>/ 
+-- bagit.txt 


+-- manifest-<algorithm>.txt 


+-- [additional tag files] 
+-- data/ 
a [payload files] 
+-- [tag directories] / 
Me [tag files] 


2.1. Required Elements 
2.1.1. Bag Declaration: bagit.txt 


The "bagit.txt" tag file MUST consist of exactly two lines in this 
order: 


BagIt-Version: M.N 
Tag-File-Character-Encoding: ENCODING 
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_M.N_ identifies the BagIt major (M) and minor (N) version numbers. 
_ENCODING_ identifies the character set encoding used by the 


remaining tag files. _ENCODING_ SHOULD be "UTF-8", but for backwards 
compatibility it MAY be any other encoding registered in 
[cs-registry]. The bag declaration itself MUST be encoded in UTF-8 


and MUST NOT contain a Byte Order Mark (BOM) [RFC3629]. 
The number for this version of BagIt is "1.0". 
2.1.2. Payload Directory: data/ 
The base directory MUST contain a subdirectory named "data". 


The payload directory contains the arbitrary digital content within 
the bag. The files under the payload directory are called payload 
files, or the payload. Each payload file is treated as an opaque 
octet stream when verifying file correctness. Payload files MAY be 
organized in arbitrary subdirectory structures within the payload 
directory; however, for the purpose of this document, such 
subdirectory structures and filenames have no given meaning. 


2.1.3. Payload Manifest: manifest-algorithm.txt 


A payload manifest file provides a complete listing of each payload 
file name along with a corresponding checksum to permit data 
integrity checking. A bag can have more than one payload manifest, 
with each using a different checksum algorithm. Manifest entries 
MUST satisfy the following constraints: 


o Every bag MUST contain at least one payload manifest file and MAY 
contain more than one. 


o Every payload manifest MUST list every payload file name exactly 
once. 


O A payload manifest file MUST have a name of the form "manifest-— 
_algorithm_.txt", where _algorithm_ is a string specifying the 
checksum algorithm used by that manifest as described in 
Section 2.4. 


Example payload manifest filenames: 


manifest-—sha256.txt 
manifest-sha512.txt 
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Each line of a payload manifest file MUST be of the form 
checksum filepath 
where _filepath_ is the pathname of a file relative to the base 


directory, and _checksum_ is a hex-encoded checksum calculated by 
applying _algorithm_ over the file. 


o The hex-encoded checksum MAY use uppercase and/or lowercase 
letters. 


o The slash character (’/’) MUST be used as a path separator in 
_filepath_. 


o One or more linear whitespace characters (spaces or tabs) MUST 
separate _checksum_ from _filepath_. 


o There is no limitation on the length of a pathname. 


o The payload manifest MUST NOT reference files outside the payload 


directory. 

o If a _filepath_ includes a Line Feed (LF), a Carriage Return (CR), 
a Carriage-Return Line Feed (CRLF), or a percent sign (%), those 
characters (and only those) MUST be percent-encoded following 
[RFC3986]. 

A manifest MUST NOT reference directories. Bag creators who wish to 

create an otherwise empty directory have typically done so by 

creating an empty placeholder file with a name such as ".keep". 


2.2. Optional Elements 
2.2.1. Tag Manifest: tagmanifest-algorithm.txt 


A tag manifest is a tag file that lists other tag files and checksums 
for those tag files generated using a particular bag checksum 
algorithm. 


A bag MAY contain one or more tag manifests, in which case each tag 
manifest SHOULD list the same set of tag files. 


Each tag manifest MUST list every payload manifest. Each tag 
manifest MUST NOT list any tag manifests but SHOULD list the 
remaining tag files present in the bag. 
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A tag manifest file MUST have a name of the form "tagmanifest- 
_algorithm_.txt", where _algorithm_ is a string following the format 
described in Section 2.4 that specifies the bag checksum algorithm 
used in that manifest. 


Tag manifests SHOULD use the same algorithms as the payload manifests 
that are present in the bag. 


Example tag manifest filenames: 


tagmanifest-sha256.txt 
tagmanifest-sha512.txt 


A tag manifest file has the same form as the payload manifest file 
described in Section 2.1.3 but MUST NOT list any payload files. As a 
result, no _filepath_ listed in a tag manifest begins "data/". 


2.2.2. Bag Metadata: bag-info.txt 


The "bag-info.txt" file is a tag file that contains metadata elements 
describing the bag and the payload. The metadata elements contained 
in the "bag-info.txt" file are intended primarily for human use. All 
metadata elements are OPTIONAL and MAY be repeated. Because "bag- 
info.txt" is intended for human reading and editing, ordering MAY be 
significant and the ordering of metadata elements MUST be preserved. 


A metadata element MUST consist of a label, a colon ":", a single 
linear whitespace character (space or tab), and a value that is 
terminated with an LF, a CR, or a CRLF. 


The label MUST NOT contain a colon (:), LF, or CR. The label MAY 
contain linear whitespace characters but MUST NOT start or end with 
whitespace. 


It is RECOMMENDED that lines not exceed 79 characters in length. 

Long values MAY be continued onto the next line by inserting a LF, 
CR, or CRLF, and then indenting the next line with one or more linear 
white space characters (spaces or tabs). Except for linebreaks, such 
padding does not form part of the value. 


Implementations wishing to support previous BagIt versions MUST 
accept multiple linear whitespace characters before and after the 
colon when the bag version is earlier than 1.0; such whitespace does 
not form part of the label or value. 


Kunze, et al. Informational [Page 9] 


RFC 8493 BagIt October 2018 


The following are reserved metadata elements. The use of these 
reserved metadata elements is OPTIONAL but encouraged. Reserved 
metadata element names are case insensitive. Except where indicated 


otherwise, these metadata element names MAY be repeated to capture 
multiple values. 


Source-Organization: Organization transferring the content. 
Organization-Address: Mailing address of the source organization. 


Contact-Name: Person at the source organization who is responsible 
for the content transfer. 


Contact-Phone: International format telephone number of person or 
position responsible. 


Contact-Email: Fully qualified email address of person or position 
responsible. 


External-—Description: A brief explanation of the contents and 
provenance. 


Bagging-Date: Date (YYYY-MM-DD) that the content was prepared for 
transfer. This metadata element SHOULD NOT be repeated. 


External-Identifier: A sender-supplied identifier for the bag. 


Bag-Size: The size or approximate size of the bag being transferred, 
followed by an abbreviation such as MB (megabytes), GB 
(gigabytes), or TB (terabytes): for example, 42600 MB, 42.6 GB, or 
-043 TB. Compared to Payload-Oxum (described next), Bag-Size is 
intended for human consumption. This metadata element SHOULD NOT 
be repeated. 


Payload-Oxum: The "octetstream sum" of the payload, which is 
intended for the purpose of quickly detecting incomplete bags 
before performing checksum validation. This is strictly an 
optimization, and implementations MUST perform the standard 
checksum validation process before proclaiming a bag to be valid. 
This element MUST NOT be present more than once and, if present, 
MUST be in the form "_OctetCount_._StreamCount_", where 
_OctetCount_ is the total number of octets (8-bit bytes) across 
all payload file content and _StreamCount_ is the total number of 
payload files. This metadata element MUST NOT be repeated. 
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any, of bags to which it logicall 


SHOULD be unique across the sender’s content, 


recognizable as belonging to a gl 


receiver SHOULD make an effort to honor the reference to it. 


metadata element SHOULD NOT be repeated. 


Bag-Count: 


Two numbers separated by "of", 


identifier for the set, if 
y belongs. This identifier 
and if it is 
obally unique scheme, the 
This 
in particular, "N of T", 


where T is the total number of bags in a group of bags and N is 


the ordinal number within the group. 
it as "?" (question mark): 
of 145. This metadata element SHOULD 
metadata element is present, 

Bag-Group-Identifier element. 


Internal-Sender-Identifier: An alternat 


for example, 


If T is not known, specify 
1 of 2, 4 of 4, 3 of ?, 89 
NOT be repeated. If this 


it is RECOMMENDED to also include the 


for the content and/or bag. 


A sender-local 


sender-specific identifier 


Internal-Sender-—Description: 
contents and provenance. 


In addition to these metadata elements, 
elements MAY also be present. 


An example of "bag-info.txt" 


Source-Organization: 
Organization-Address: 
Contact-Name: Jane Doe 
Contact—Phone: +1 111-111-1111 
Contact-—Email: example@example.com 
External—Description: 
FOO papers colle... 
Bagging-Date: 2008-01-1 
External-Identifier: university_foo_001 
Payload-Oxum: 279164409832.1198 
Bag-Group-Identifier: university_foo 
Bag-Count: 1 of 15 
Internal-Sender-Identifier: 


FOO University 
1 Main St., 


Cupertino, 


explanation of the 


other arbitrary metadata 


file is as follows: 


California, 11111 


Uncompressed greyscale TIFF images from the 


/storage/images/foo 


Internal-Sender-—Description: 
from microfilm and are... 
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2.2.3. Fetch File: fetch txt 


For reasons of efficiency, a bag MAY be sent with a list of files to 
be fetched and added to the payload before it can meaningfully be 


checked for completeness. The fetch file allows a bag to be 
transmitted with "holes" in it, which can be practical for several 
reasons. For example, it obviates the need for the sender to stage a 


large serialized copy of the content while the bag is transferred to 
the receiver. Also, this method allows a sender to construct a bag 
from components that are either a subset of logically related 
components (e.g., the localized logical object could be much larger 
than what is intended for export) or assembled from logically 
distributed sources (e.g., the object components for export are not 
stored locally under one filesystem tree). An OPTIONAL tag file, 
called the fetch file, contains such a list. 


The fetch file MUST be named "fetch.txt". Every file listed in the 
fetch file MUST be listed in every payload manifest. A fetch file 
MUST NOT list any tag files. 


Each line of a fetch file MUST be of the form 
url length filepath 


where _url_ identifies the file to be fetched and MUST be an absolute 


URI as defined in [RFC3986], _length_ is the number of octets in the 
file (or "-", to leave it unspecified), and _filepath_ identifies the 
corresponding payload file, relative to the base directory. 


The slash character (’/’) MUST be used as a path separator in 
_filepath_. One or more linear whitespace characters (spaces or tabs) 
MUST separate these three values, and any such characters in the 


_url_ MUST be percent-encoded [RFC3986]. If _filename_ includes an 
LF, a CR, a CRLF, or a percent sign (%), those characters (and only 
those) MUST be percent-encoded as described in [RFC3986]. There is 


no limitation on the length of any of the fields in the fetch file. 
2.2.4. Other Tag Files 

A bag MAY contain other tag files that are not defined by this 

document. Implementations MUST perform standard checksum validation 


on any tag file that is listed in a tag manifest but MUST otherwise 
ignore their contents. 
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2.3. Text Tag File Format 


All tag files specifically described in this document MUST adhere to 
the text tag file format described below. Other tag files MAY adhere 
to the text tag file format described below. 


Text tag files are line oriented, and each line MUST be terminated by 
an LF, a CR, or a CRLF. It is RECOMMENDED that the last line ina 
tag file also end with LF, CR, or CRLF. Text tag file names MUST end 
in the extension ".txt". 


In all text tag files except for the bag declaration file, text MUST 
use the character encoding specified in the "bagit.txt" bag 
declaration file. Text tag files except for the bag declaration file 
MAY include a Byte Order Mark (BOM) only if the specified encoding 
requires it for proper decoding. In accordance with [RFC3629], when 
"bagit.txt" specifies UTF-8, the tag files MUST NOT begin with a BOM. 
See Section 2.1.1. 


The use of UTF-8 for text tag files is strongly RECOMMENDED. A 
future version of BagIt may disallow encodings other than UTF-8. 


2.4. Bag Checksum Algorithms 


The payload manifest and tag manifest permit validating the integrity 
of the payload and tag files in a bag produced by the checksum 
algorithms. Checksum values MUST be encoded so as to conform to the 
manifest format specified in Section 2.1.3. However, the internal 
details of a checksum are outside the scope of this document. 


(aa 


To avoid future ambiguity, the checksum algorithm SHOULD be 
registered in IANA’s "Named Information Hash Algorithm Registry" 
ni-registry] according to [RFC6920] but MAY, for backwards 
compatibility, also be MD5 [RFC1321] or SHA-1 [RFC3174]. 


The name of the checksum algorithm MUST be normalized for use in the 
manifest’s filename by lowercasing the common name of the algorithm 
and removing all non-alphanumeric characters. Following is a partial 
list that maps common algorithm names to normalized names: 


o MD5: mdd 
o SHA-1: shal 
o sha-256: sha256 


o sha-512: sha512 
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Starting with BagIt 1.0, bag creation and validation tools MUST 
support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD 
enable SHA-512 by default when creating new bags. For backwards 
compatibility, implementers SHOULD support MD5 [RFC1321] and SHA-1 
[RFC3174]. Implementers are encouraged to simplify the process of 
adding additional manifests using new algorithms to streamline the 
process of in-place upgrades. 


3. Complete and Valid Bags 


A _complete_ bag MUST meet the following requirements: 


1. Every required element MUST be present (s Section 2.1). 


2. Every fil listed in every tag manifest MUST be present. 


3. Every file listed in every payload manifest MUST be present. 


4. For BagIt 1.0, every payload file MUST be listed in every payload 
manifest. Note that older versions of BagIt allowed payload 
files to be listed in just one of the manifests. 


5. Every element present MUST conform to BagIt 1.0. 

A _valid_ bag MUST meet the following requirements: 

1. The bag MUST be _complete_. 

2. Every checksum in every payload manifest and tag manifest has 


been successfully verified against the contents of the 
corresponding file. 


Kunze, et al. Informational [Page 14] 


RFC 8493 BagIt October 2018 


4. Examples 
4.1. Example of a Basic Bag 


This is the layout of a basic bag containing an image and a companion 
Optical Character Recognition (OCR) file. Lines of file content are 
shown with added parentheses to indicate each complete line. For 
brevity, this example uses MD5 rather than the recommended SHA-512. 


myfirstbag/ 


manifest-md5.txt 
(49afbd86alca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 
(408ad21d50cef31lda4df6d9ed81b01a7 data/27613-h/images/q172.txt) 


bagit.txt 
(BagIt-version: 1.0 ) 
(Tag-File-Character-Encoding: UTF-8 ) 


\--- data/ 


27613-h/images/ql72.png 
(... image bytes ... ) 


27613-h/images/ql172.txt 
Cour: (OCR TExt siz ) 


Kunze, et al. Informational [Page 15] 


RFC 8493 BagIt October 2018 


4.2. Example Bag Using fetch.txt 


This is the layout of a bag that expects the receiver to download the 
files listed in the payload manifests prior to validation. Lines of 
file content are shown with added parentheses to indicate each 
complete line. For brevity, this example uses MD5 rather than the 
recommended SHA-512. 


highsmith-tahoe/ 


manifest-md5.txt 
(102b0e6effe208ef9b29864946de9e22 data/23364a.tif ) 


fetch.txt 
(https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif 
216951362 data/23364a.tif ) 


bagit.txt 
(BagIt-version: 1.0 ) 
(Tag-File-Character-Encoding: UTF-8 ) 


bag-info.txt 
(Internal-Sender-—Description: Download link found at ) 
( https://www.loc.gov/resource/highsm.23364/ ) 


5. Security Considerations 
5.1. Special Directory Characters 


The paths specified in the payload manifests, tag manifests, and 
fetch files do not prohibit special directory characters that have 
special meaning on some operating systems. Implementers MUST ensure 
that files outside the bag directory structure are not accessed when 
reading or writing files based on paths specified in a bag. 


All implementations SHOULD have a test suite to guard against special 
directory characters. 


For example, a maliciously crafted "tagmanifest-—sha512.txt" file 
might contain entries that begin with a path character such as "/", 
"..", or a "“username" home directory reference in an attempt to 
cause a naive implementation to leak or overwrite targeted files ona 
POSIX operating system. 


Windows implementations SHOULD test their implementations to ensure 
that safety checks prevent use of drive letters and the less commonly 
used namespace sequences (e.g., "\\?\C:\...") described in [MSFNAM]. 
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To assist implementers, the Library of Congress conformance suite 
[LC-CONFORMANCE-SUITE] has some tests for invalid bags that are 
expected to fail on POSIX or Windows clients. 


-2. Control of URLs in fetch.txt 


Implementers of tools that complete bags by retrieving URLs listed in 
a fetch file need to be aware that some of those URLs might point to 
hosts, intentionally or unintentionally, that are not under control 
of the bag’s sender. Moreover, older checksum algorithms, even if 
reasonable for detecting corruption during transit, may not offer 
strong cryptographic protection against intentional spoofing. 


.3. File Sizes in fetch.txt 


The size of files, as optionally reported in the fetch file, cannot 
be guaranteed to match the actual file size to be downloaded. 
Implementers SHOULD take steps to monitor and abort transfer when the 
received file size exceeds the file size reported in the fetch file. 
Implementers SHOULD NOT use the file size in the fetch file for 
critical resource allocation, such as buffer sizing or storage 
requisitioning. 


4. Attacks on Payload File Content 


The integrity assurance provided by manifests is designed to provide 
high levels of confidence against data corruption but is not designed 
to be secure against active attacks. Organizations that need to 
secure bags against such threats SHOULD agree on additional measures, 
such as digital signatures, that are out of scope for this 
specification. 


Practical Considerations (Non-normative) 


.1. Interoperability 


This section lists practical considerations for implementers and 
users. None of the points below are required, but they are 
recommended for general-purpose usage. 


Upon discovering errors in bags, an implementation is free to take 
action (for example, logging or reporting) in an application-specific 
manner. This document does not mandate any particular action. 


The Library of Congress conformance suite [LC-CONFORMANCE-SUITE] is 
provided as a public resource to test new implementations for 
compatibility and error handling. 
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various challenges 
filesystems, and 
followed by a list of 


recommendations for implementers in Section 6.1.1.3. 


6.1.1.1. Case Sensitivity 
There are thr 
case: 


o Filesystems such as File Allocation Table 


Allocation Table 
"example.txt" will be stored as 


o Many Unix filesystems save 
allows multiple files that 
"Example.txt" are separate 


o New Technology File System 
System (HFS) Plus 
case insensitiv 


challenges for interoperability 


filenames exactly as provided, 
differ only in case: 
files. 


(NTFS) 
usually preserve case when storing files but are 
when retrieving them. 


related to filename 


(FAT) or Extended File 


(EXFAT) always convert filenames to uppercase: 


"EXAMPLE.TXT". 


which 
"example.txt" and 


and Apple’s Hierarchical File 


A file saved as 


"Example.txt" will be retrieved by that name but will also be 


retrieved as "EXAMPLE.TXT", 


6.1.1.2. Unicode Normalization 


"example.txt", 


etc. 


The Unicode specification has common cases where different character 


sequences produce the same human-meaningful text. 


to as "canonical 
different normalization forms - 
details. 


The example below shows the common surname 


different forms. 


These are referred 


ly equivalent" and the Unicode specification defines 
see 


[UNICODE-TR15] for the full 


"Nunez" normalized in 


Normalization Form D (Decomposition) 
Char UTF8 Hex Name 
N 4e LATIN CAPITAL LETTER N 
u 75 LATIN SMALL LETTER U 
\u0301 cc81 COMBINING ACUTE ACCENT 
n 6e LATIN SMALL LETTER N 
\u0303 cc83 COMBINING TILDE 
e 65 LATIN SMALL LETTER E 
Z 7a LATIN SMALL LETTER Z 
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Normalization Form C (Canonical Composition) 


Char UTF8 Hex Name 

N 4e LATIN CAPITAL LETTER N 

u c3ba LATIN SMALL LETTER U WITH ACUTE 
n c3b1 LATIN SMALL LETTER N WITH TILDE 
e 65 LATIN SMALL LETTER E 

Z 7a LATIN SMALL LETTER Z 


Unicode normalization is relevant to BagIt implementors because 
different systems have different standards for normalization: 


o Apple’s HFS Plus filesystem always normalizes filenames to a fully 
decomposed form based on the Unicode 2.0 specification (see 
[TN1150]). 


o Windows treats filenames as opaque character sequences (see 
[MSFNAM]) and will store and return the encoded bytes exactly as 
provided. 


o Linux and other common Unix systems are generally similar to 
Windows in storing and returning opaque byte streams, but this 
behavior is technically dependent on the filesystem. 


o Utilities used for file management, transfer, and archiving may 
ignore this issue, apply an arbitrary normalization form, or allow 
the user to control how normalization is applied. 


In practice, this means that the encoded filename stored ina 
manifest may fail a simple file existence check because the 
filename’s normalization was changed at some point after the manifest 
was written. This situation is very confusing for users because the 
filenames are visually indistinguishable, and the "missing" file is 
obviously present in the payload directory. 


6.1.1.3. Recommendations 


o Implementations SHOULD discourage the creation of bags containing 
files that differ only in case. 


o Implementations SHOULD prevent the creation of bags containing 
files that differ only in normalization form. 


o BagIt implementations SHOULD tolerate differences in normalization 
form by comparing both the list of filesystem and manifest names 
after applying the same normalization form to both. 
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o Implementations SHOULD issue a warning when multiple manifests are 
present that differ only in case or normalization form. 


6.1.2. Windows and Unix File Naming 


As specified above, only the Unix-based path separator (’/’) may be 
used inside filenames listed in BagIt manifest and fetch.txt files. 
When bags are exchanged between Windows and Unix platforms, the path 
separator SHOULD be translated as needed. Receivers of bags on 
physical media SHOULD be prepared for filesystems created under 
either Windows or Unix. Besides the fundamental difference between 
path separators (’\’ and '/’), generally, Windows filesystems have 
more limitations than Unix filesystems. 


Windows path names have a maximum of 255 characters, and none of 
these characters may be used in a path component: 


RE A p aoe 


Windows also reserves the following names, with or without a file 
extension: 


CON, PRN, AUX, NUL 
COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 
LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 


See [MSFNAM] for more information and possible alternatives. 
6.1.3. Legacy Checksum Tools 


Some bags have been manually assembled using checksum utilities such 
as those contained in the GNU Coreutils package (md5sum, shalsum, 
etc.), collectively referred to here as "md5sum". Implementers who 
desire wide support of legacy content should be aware of some known 
quirks of these tools. 


md5sum can be run in "text mode", which causes it to normalize line 
endings on some operating systems. On Unix-like systems, both modes 
will usually produce the same results; on systems like Windows, they 
can produce different results based on the file contents. The md5sum 
output format has two characters between the checksum and the 
filepath: the first is always a space, and the second is an asterisk 
("*") for binary mode and a space for text mode. 


A final note about md5sum-generated manifests is that, fora 
_filepath_ containing a backslash (V), the manifest line will have 
a backslash inserted in front of the _checksum_ and, under Windows, 
the backslashes inside _filepath_ can be doubled. 
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Implementers MAY wish to accept this format by ignoring a leading 
asterisk or handling differences in line termination gracefully but, 
if so, implementations MUST warn the user that the bag in question 
will fail strict validation. In such cases, it is RECOMMENDED that 


tools provide an easy option to update the bag with valid manifests. 


7. Augmented Backus-Naur Form (Non-normative) 


The Augmented Backus-Naur Form (ABNF) rules provided below are non- 


normative. If there is a discrepancy between requirements in the 
normative sections and the ABNF, the requirements in the normative 
sections prevail. Some definitions use the core rules (e.g., DIGIT, 


HEXDIG, etc) as defined in [RFC5234]. 


7.1. Bag Declaration: bagit.txt 


bagit.txt ABNF rules: 


bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending 
"Tag-File-Character-Encoding: " encoding ending 

encoding = 1*CHAR 

ending = CR / LF / CRLF 


7.2. Payload Manifest: 


manifest-algorithm.txt 


Payload Manifest ABNF rules: 


payload-manifest 
payload-manifest-—line 
checksum 

case-hexdig 


filepath 


unreserved 
sub-delims 


pct-—encoded 
ending 


Kunze, et al. 


1*payload-manifest-line 

checksum 1*WSP filepath ending 
1*case-hexdig 

DIGIT f "an / wan / "Ba / Wp" / non / Teon / 
npr / mq" / wR / tany LL i / aie ee 


"data/" 

1*( unreserved / pct-encoded / sub-delims ) 
ALPHA / DIGIT / wow / wow / se J "n~n 

" ! " / " $ " 7. " & " / DQUOTE / wrw iA " ( " / " ) " ff 
wee / win / wm / wm / wow / "y/n 


"sop" / "20a" / "SOA" / "Z0a" / woa5" 
CR / LF / CRLF 
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Bag Metadata: bag-info.txt 


bag-info.txt ABNF rules: 


met 


metadata-line 


key 
val 
con 


non-reserved 


adata 


ue 
tinuation 


ending 


7.4. 


1*metadata-line 

key ":" WSP value ending * (continuation ending) 

= 1*non-reserved 

= 1*non-reserved 

= WSP 1*non-reserved 

= VCHAR / WSP 
; any valid character for the specific encoding 
; except those that match "ending" 

= CR / LF / CRLF 


Fetch File: fetch.txt 


fetch.txt ABNF rules: 


fet 
fet 
url 
len 
fil 


ch = 
ch-line 


gth = 
epath = 


ending = 


1*fetch-line 
url 1*WSP length 1*WSP filepath ending 
<absolute-URI, see [RFC3986], Section 4.3> 
1*DIGIT / "-" 

("data/" 

1*( unreserved / pct-encoded / sub-delims )) 
CR / LF / CRLF 


8. IANA Considerations 


This document has no IANA actions. 


9. Re 


9.1. 


[cs-registry] 


[ni-registry] 


[RFC1321] 


Kunze, 


ferences 


Normative 
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